Topic: Why can the PCA algorithm on the NanoRam 785 differentiate between calcium nitrate and magnesium nitrate, but the RVM algorithm on the NanoRam 1064 cannot?
Principal component Analysis (PCA) is a procedure which takes a set of variables and transforms them into a new set of variables which have no collinearity, called principal components. Dimensionality reduction techniques such as PCA are often used in spectral classification and prediction due to the large number of variables present in each spectrum. Each data point in a spectrum is considered to be a variable, and reducing the number of variables is crucial in creating a meaningful model. PCA uses an approach called feature extraction, which involves creating new features from recombinations of existing variables in order to sufficiently describe a dataset with a smaller amount of variables.
First, methods containing 20 scans each were built using the B&W Tek handheld Raman analyzer, NanoRam 785, for a single sample of both magnesium nitrate and calcium nitrate. Then the same nitrate samples were measured using a B&W Tek NanoRam 1064 handheld Raman analyzer to create a five scan method for each sample. The raw data was processed with dark subtraction, scaling, and centering. Running PCA on the processed 785 nm spectral data yields the following:
Scree plots show the eigenvalues of the principal components. The first one depicts the raw values of the variance on the y axis and the principal components in descending order by eigenvalue magnitude:
The second plot depicts the percentage of the total variance explained by each principal component, again ordered in descending order by eigenvalue magnitude:
The information in the preceding graphs is also displayed analytical in the below table:
## Importance of first k=5 (out of 40) components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 22.4054 4.68831 3.55776 2.43046 2.2113
## Proportion of Variance 0.8626 0.03777 0.02175 0.01015 0.0084
## Cumulative Proportion 0.8626 0.90031 0.92206 0.93221 0.9406
A 95% confidence ellipse was constructed for each class . Under a large number of repeated sampling from the underlying distribution, and each time calculate a confidence ellipse, 95% of the constructed ellipses would contain the underlying mean of the distribution.
The calcium nitrate samples had low in-class variance, but the magnesium nitrate samples had a few massive outliers that significantly raised the variance and heavily affected the PC construction. Furthermore, the between-class variance is fairly small, the two main clusters are right next to each other. Can we do better?
Scaling before running PCA is a seemingly contentious issue in data science and spectral literature. There is a general consensus that scaling is required when variables have different units or greatly differing standard deviations. However, spectral data variables are all just relative intensities at certain pixels, which means that the variables will have the same units and are all measured on the same scale. Given that these two conditions are met, it would be acceptable and potentially even advantageous not to scale the variables and perform the PCA based on the covariance matrix. Otherwise, scaling the variables and using the correlation matrix to perform the PCA makes more sense.
Outliers can also have a substantial impact on the construction of principal components. From part one, it was immediately obvious that there was an outlier present from the magnesium nitrate group skewing the results of the PCA. Examining the spectra in a spectroscopy software called BWIQ graphically indicates that magnesium nitrate scan #12 is vastly different from the other magnesium nitrate scans and was likely due to human error in taking the scan of the sample. Two other potential outliers are shown in the PCA graph from part one, but spectrally they are not dissimilar to the other samples in the method, and after running PCA again with the largest outlier removed, those two scans are no longer problematic.
First, methods containing 20 scans each were built using the B&W Tek handheld Raman analyzer, NanoRam 785, for a single sample of both magnesium nitrate and calcium nitrate. The raw data was processed with dark subtraction, and centering. Magnesium nitrate sample scan #12 was removed from the dataset as an outlier. Running PCA on the processed 785 nm spectral data yields the following:
## Importance of first k=5 (out of 39) components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 7784.6314 4094.0216 1.466e+03 730.90832 621.10275
## Proportion of Variance 0.7284 0.2015 2.582e-02 0.00642 0.00464
## Cumulative Proportion 0.7284 0.9299 9.557e-01 0.96215 0.96678
Reduced Variable Multivariate (RVM) is another dimensionality reduction procedure which takes a set of variables and deletes many of the variables until a small subset of variables that accurately describe the dataset remain. RVM uses an approach called feature deletion, which differs from feature extraction in a couple ways. First, since variables are just removed instead of being transformed into new ones, there is a loss of data present in feature selection, whereas all of the original data is preserved in feature extraction. Second, there is no guarantee that the variables are free of collinearity in feature selection, while feature extraction does guarantee orthogonal variables with no collinearity. Before attempting to analyze the data collected on the 1064 nm device with RVM, first it will be run through the PCA model as a baseline for comparison.
There don’t seem to be any outliers present, and learning from the previoius sections, the variables are not scaled in the preprocessing steps.
## Importance of first k=3 (out of 12) components:
## PC1 PC2 PC3
## Standard deviation 3.019e+04 1.356e+04 3.713e+03
## Proportion of Variance 8.159e-01 1.645e-01 1.234e-02
## Cumulative Proportion 8.159e-01 9.803e-01 9.927e-01
Currently the ellipses are not being drawn…need to fix error =[
The PCA algorithm has no trouble distinguising between the two kinds of nitrates on the data from the 1064 nm device either. This would point to a flaw in the RVM algorithm, as the RVM method had specificity issues when examining the nitrate samples on the 1064 nm device. Further investigation will be conducted to uncover the issues the 1064 nm device is having with its current algorithm.
I’m eventually going to try to build an RVM algorithm from scratch since I won’t be able to get any source code. All I have to go on is that RVM uses feature selection, so there is probably collinearity between the remaining variables, and that a variable inflation factor is applied, which definitely causes a decrease in specificity (more false positives). There appears to be no way to alter the default inflation factor of 5 on the NanoRam 1064, although the internal paper mentions that it can be lowered to deal with false positive issues. For now, I will discuss the potential drawbacks of an RVM method and what I believe to be the root cause of the specificity issues.
The number of false negatives using RVM is quite low. The main issue is false positives, i.e. a sample will come up as a positive match for two different methods. As mentioned in the technical paper, there is a variance inflation factor used to increase the variance of each spectrally reduced variable in the RVM method. The claim is that this will reduce the number of false negatives, which of course is true. However, the inflation factor will also inherently cause there to be a greater number of false positives. As I mentioned in the introduction, the technical paper mentions that lowering this inflation factor will increase specificity for distinguishing spectrally similar compounds, but apparently there is no option to perform this action on the actual device.
An interesting phenomenon sometimes occurs when comparing two spectrally similar compounds. The method for sample A will work perfectly fine, but then the method for sample B will not be able to tell A and B apart, giving false positives for scans of sample A. Why does this occur? It appears that the underlying cause could be related to the spread of the data in each method being different magnitudes. There are a few visual aids that may be able to shed some light on the problem. Notice how the confidence region is artificially boosted for these two distributions:
For well-separated spectra like in the above, obviously the confidence regions are not very close together, so it is unlikely that any false positives will occur. However, if the confidence regions are closer together…
…this can cause issues. Notice the disparity in the area of the ellipses; the (blue) calcium nitrate data has a much smaller variance than the (orange) magnesium nitrate data, and thus its ellipse has significantly less area. Why does this matter? My intuitive guess is that When a sample of calcium nitrate and magnesium nitrate are run against the calcium nitrate method, the variance multiplication does not increase the area of the blue calcium nitrate ellipse enough to “cross” into the threshold of the orange magnesium nitrate ellipse. This means that the method will be able to properly differentiate between the two samples. However, when the magnesium nitrate method is used, the magnesium nitrate ellipse is much larger to start with, so when the size is scaled up with the inflation factor, it practically devours the calcium nitrate ellipse, leading to a false positive scan against the method. Note that the example uses scans from the 785 device with 20 scans; variance problems will be even higher in the 1064 device because the methods are built using only five scans. Without knowing explicitly how the RVM method works, this is only conjecture, but it gives a conceptual framework for understanding the crux of the issue. The p-value crosstable provides some support for this idea, since the calcium nitrate method has a false positive issue, and that method also has much higher variance on the 1064 scans:
More investigation will be required to ascertain whether or not this hypothesis is on the right track or not. Examination of other data sets is underway, as is the rogue RVM development. Stay tuned!
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] factoextra_1.0.6 ggfortify_0.4.8 data.table_1.12.2
## [4] plyr_1.8.4 forcats_0.4.0 stringr_1.4.0
## [7] dplyr_0.8.3 purrr_0.3.2 readr_1.3.1
## [10] tidyr_1.0.0 tibble_2.1.3 tidyverse_1.2.1
## [13] plotly_4.9.1 ggplot2_3.2.1
##
## loaded via a namespace (and not attached):
## [1] ggrepel_0.8.1 Rcpp_1.0.2 lubridate_1.7.4
## [4] lattice_0.20-38 assertthat_0.2.1 zeallot_0.1.0
## [7] digest_0.6.21 mime_0.7 R6_2.4.0
## [10] cellranger_1.1.0 backports_1.1.4 evaluate_0.14
## [13] httr_1.4.1 pillar_1.4.2 rlang_0.4.0
## [16] lazyeval_0.2.2 readxl_1.3.1 rstudioapi_0.10
## [19] rmarkdown_1.15 labeling_0.3 htmlwidgets_1.3
## [22] munsell_0.5.0 shiny_1.3.2 broom_0.5.2
## [25] compiler_3.6.1 httpuv_1.5.2 modelr_0.1.5
## [28] xfun_0.9 pkgconfig_2.0.3 htmltools_0.3.6
## [31] tidyselect_0.2.5 gridExtra_2.3 viridisLite_0.3.0
## [34] crayon_1.3.4 withr_2.1.2 ggpubr_0.2.4
## [37] later_0.8.0 grid_3.6.1 xtable_1.8-4
## [40] nlme_3.1-140 jsonlite_1.6 gtable_0.3.0
## [43] lifecycle_0.1.0 magrittr_1.5 scales_1.0.0
## [46] cli_1.1.0 stringi_1.4.3 ggsignif_0.6.0
## [49] promises_1.0.1 xml2_1.2.2 generics_0.0.2
## [52] vctrs_0.2.0 tools_3.6.1 glue_1.3.1
## [55] hms_0.5.1 crosstalk_1.0.0 yaml_2.2.0
## [58] colorspace_1.4-1 rvest_0.3.4 knitr_1.25
## [61] haven_2.1.1